Communication-avoiding parallel and sequential QR factorizations

نویسندگان

  • James Demmel
  • Laura Grigori
  • Mark Hoemmen
  • Julien Langou
چکیده

We present parallel and sequential dense QR factorization algorithms that are optimized to avoid communication. Some of these are novel, and some extend earlier work. Communication includes both messages between processors (in the parallel case), and data movement between slow and fast memory (in either the sequential or parallel cases). Our first algorithm, Tall Skinny QR (TSQR), factors m× n matrices in a one-dimensional (1-D) block cyclic row layout, storing the Q factor (if desired) implictly as a tree of blocks of Householder reflectors. TSQR is optimized for matrices with many more rows than columns (hence the name). In the parallel case, TSQR requires no more than the minimum number of messages Θ(logP ) between P processors. In the sequential case, TSQR transfers 2mn + o(mn) words between slow and fast memory, which is the theoretical lower bound, and performs Θ(mn/W ) block reads and writes (as a function of the fast memory size W ), which is within a constant factor of the theoretical lower bound. In contrast, the conventional parallel algorithm as implemented in ScaLAPACK requires Θ(n logP ) messages, a factor of n times more, and the analogous sequential algorithm transfers Θ(mn) words between slow and fast memory, also a factor of n times more. TSQR only uses orthogonal transforms, so it is just as stable as standard Householder QR. Both parallel and sequential performance results show that TSQR outperforms competing methods. Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization, which both remove a latency bottleneck in ScaLAPACK’s current parallel approach, and both bandwidth and latency bottlenecks in ScaLAPACK’s out-of-core QR factorization. CAQR achieves modeled speedups of 2.1× on an IBM POWER5 cluster, 3.0× on a future petascale machine, and 3.8× on the Grid.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementing Communication-optimal Parallel and Sequential Qr Factorizations

We present parallel and sequential dense QR factorization algorithms for tall and skinny matrices and general rectangular matrices that both minimize communication, and are as stable as Householder QR. The sequential and parallel algorithms for tall and skinny matrices lead to significant speedups in practice over some of the existing algorithms, including LAPACK and ScaLAPACK, for example up t...

متن کامل

Communication Avoiding Rank Revealing QR Factorization with Column Pivoting

In this paper we introduce CARRQR, a communication avoiding rank revealing QRfactorization with tournament pivoting. We show that CARRQR reveals the numerical rank of amatrix in an analogous way to QR factorization with column pivoting (QRCP). Although the upperbound of a quantity involved in the characterization of a rank revealing factorization is worse forCARRQR than for QRCP...

متن کامل

Communication-optimal parallel and sequential QR and LU factorizations: theory and practice

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny QR (TSQR), factors m × n matrices in a one-dimensional (1-D) block cyclic row layout, and is optimized for m n. Our second algorithm, CAQR (Communication-Avoi...

متن کامل

Communication-optimal Parallel and Sequential QR and LU Factorizations

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform and just as stable as Householder QR. We prove optimality by deriving new lower bounds for the number of multiplications done by “non-Strassen-like” QR, and using these in known communication lower bounds that are proportional to ...

متن کامل

MATHEMATICAL ENGINEERING TECHNICAL REPORTS CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System

Designing communication-avoiding algorithms is crucial for high performance computing on a largescale parallel system. The TSQR algorithm is a communication-avoiding algorithm for computing a tall-skinny QR factorization, and TSQR is known to be much faster and as stable as the classical Householder QR algorithm. The Cholesky QR algorithm is another very simple and fast communication-avoiding a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/0806.2159  شماره 

صفحات  -

تاریخ انتشار 2008